Module 8 - Programming Assignment

This is the notebook for the Module 8 Programming Assignment.

Here are a few tips for using the iPython HTML notebook:

  1. You can use tab . Try le<<tab> and see the available functions.
  2. You can change the type of cell by picking "Code" or "Markdown" from the menu at the left.
  3. If you keep typing in a Markdown text area, you will eventually get scroll bars. To prevent this, hit return when you come to the end of the window. Only a double return creates a new paragraph.
  4. You can find out more about Markdown text here: http://daringfireball.net/projects/markdown/ (Copy this link and put it in another tab for reference--Don't click it or you'll leave your notebook!).
  5. Every so often, restart the kernel, clear all output and run all code cells so you can be certain that you didn't define something out of order.

You should rename this notebook to be <your JHED id>_PR8.ipynb Do it right now.

Make certain the entire notebook executes before you submit it. (See Hint #5 above)

Change the following variables:


In [1]:
name = "Shehzad Qureshi"
jhed_id = "squresh6"
if name == "Student Name" or jhed_id == "sname1":
    raise Exception( "Change the name and/or JHED ID...preferrably to yours.")

Add whatever additional imports you require here. Stick with the standard libraries and those required by the class. The import gives you access to these functions: http://ipython.org/ipython-doc/stable/api/generated/IPython.core.display.html (Copy this link) Which, among other things, will permit you to display HTML as the result of evaluated code (see HTML() or display_html()).


In [2]:
from IPython.core.display import *
import numpy as np

For this assignment, you are going to implement both linear and logistic regression. For a starting point, you can refer to module-8-pseudocode.pdf. Remember, the only difference between the two is the actual specification of the error function. You will need some data. There are two files accompanying this assignment, linear_regression.csv and logistic_regression.csv

The files are CSV files (comma separated values) and have the same basic layout: x1, x2, y with no headers. You can google how to read a CSV file. You should assume that the last value in each row will always be the value you need to predict but that there may be any number of explanatory (feature) values. You need to add the x0 == 1 (bias) term.

In the lecture, I mentioned that you usually should mean normalize your data but you don't need to do that in this case because the data is already mean normalized.

I should mention that gradient descent is not the usual approach to (logistic) regression. There is a way to solve these exactly. The use of gradient descent is pedagogical, to prepare you for neural networks. Gradient descent is also used in cases where exact methods won't work.

The usual directions apply: fully document your functions (both what and why) and include a final discussion of the assignment and concepts.

Gradient Descent

Gradient descent is an algorithm similar to hill-climbing. The aim of the algorithm is to try to compute weights of a line that best fits the dataset, and it does this by incrementally adjusting the weights and minimizing the error.

We first need a way to translate the input data. We are assuming that input data is a list of lists, with each list containing an example of features and an output, i.e. each list will be of the form [x1 x2 x3 ... xn y]. Numpy's loadtxt function can parse csv files into this format. To make the algorithm easier we add x0 = 1 to the x vector.


In [3]:
def get_data(data):
    x, y = np.hsplit(data, np.array([np.shape(data)[1]-1]))
    x = np.insert(x, 0, 1, axis=1)
    return x, y

This is our function for computing the error for linear regression.

$J(\theta) = {1 \over 2n} \sum{(\hat{y} - y)}^2$

where

$\hat{y} = \theta \cdot x$

and $n$ is the number of examples in the data set. We square the difference in order to make the derivative a bit cleaner.


In [4]:
def calculate_linear_error(x, y, thetas):
    y_hat = np.dot(x, thetas).reshape(np.shape(y))
    e = np.sum((y_hat - y)**2) / (2 * np.shape(x)[0])
    return e

This is our function computing the error for logistic regression.

$J(\theta) = - {1 \over n} \sum y \cdot \log(\hat{y}) + (1 - y)\log(1 - \hat{y})$

where

$\hat{y} = {1 \over {1 + e^{-\theta \cdot x}}}$

and $n$ is the number of examples in the data set. Multiplying the sum of two vector indices is the same as the dot product of two vectors.


In [5]:
def calculate_logistic_error(x, y, thetas):
    tmp = np.dot(x, thetas).reshape(np.shape(y))
    y_hat = 1 / (1 + np.exp(-tmp))
    e = np.sum(y*np.log(y_hat) + (1-y)*np.log(1-y_hat)) / -np.shape(x)[0]
    return e

Our function for computing the derivatives follow. We have one for linear regression and one for logistic regression, which are essentially the same except for the defintion of $\hat{y}$.

$\frac{\partial J} {\partial \theta_j} = {1 \over n} \sum (\hat{y} - y) \cdot x_j$

$n$ is the number of examples in the data set and $x_j$ is the entire column of data for each feature.


In [6]:
def linear_derivative(x, y, thetas):
    y_hat = np.dot(x, thetas).reshape(np.shape(y))
    d = np.array([np.dot(x[:, i], y_hat - y)[0]/np.shape(x)[0] for i in xrange(np.shape(x)[1])])
    return d

In [7]:
def logistic_derivative(x, y, thetas):
    tmp = np.dot(x, thetas).reshape(np.shape(y))
    y_hat = 1 / (1 + np.exp(-tmp))
    d = np.array([np.dot(x[:, i], y_hat - y)[0]/np.shape(x)[0] for i in xrange(np.shape(x)[1])])
    return d

The two functions below, one for linear regression and one for logistic regression, take in rows of data and returns a list of coefficients or weights, the theta vector. The initial theta vector is randomly generated from [0, 1). The algorithm iterates until the error difference between iterations is less than $\epsilon$.

The functions are exactly the same with the exception of initial parameters and the calls to their respective error and derivative calculations.


In [8]:
def linear_regression(data, alpha=0.001, epsilon=0.00001):
    x, y = get_data(data)
    thetas = np.random.random(size=np.shape(x)[1])
    previous_error, current_error = 0, calculate_linear_error(x, y, thetas)
    while np.absolute(current_error - previous_error) > epsilon:
        new_thetas = thetas - alpha * linear_derivative(x, y, thetas)
        thetas, previous_error = new_thetas, current_error
        current_error = calculate_linear_error(x, y, thetas)
    return thetas

In [9]:
def logistic_regression(data, alpha=0.05, epsilon=0.00001):
    x, y = get_data(data)
    thetas = np.random.random(size=np.shape(x)[1])
    previous_error, current_error = 0, calculate_logistic_error(x, y, thetas)
    while np.absolute(current_error - previous_error) > epsilon:
        new_thetas = thetas - alpha * logistic_derivative(x, y, thetas)
        thetas, previous_error = new_thetas, current_error
        current_error = calculate_logistic_error(x, y, thetas)
    return thetas

Testing

Let's start by defining a function to read in our csv file. We can use Numpy's loadtxt function to read in the file.


In [10]:
def read_file(filename):
    return np.loadtxt(filename, delimiter=',')

Now we can ru our linear regression algorithm against the data set provided.


In [11]:
data = read_file('linear_regression.csv')
linear_regression(data)


Out[11]:
array([ 1.02645659,  1.95674332,  2.88966996])

The textbook states that there is a way to solve analytically for weights or coefficients that minimizes loss. The formula uses matrices:

$w^* = (X^T \cdot X)^{-1} \cdot X^T \cdot y$


In [12]:
def solve_linear_regression(x, y):
    xm, ym = np.asmatrix(x), np.asmatrix(y)
    s = xm.T.dot(xm).I.dot(xm.T).dot(ym)
    return s

In [13]:
data = read_file('linear_regression.csv')
x, y = get_data(data)
solve_linear_regression(x, y)


Out[13]:
matrix([[ 1.],
        [ 2.],
        [ 3.]])

We see that our results from gradient descent actually come pretty close. Now we test logistic regression.


In [14]:
data = read_file('logistic_regression.csv')
logistic_regression(data)


Out[14]:
array([ 0.42060328,  1.23945647,  2.0392467 ])

Let's write a function to generate random feature data from (-1, 1).


In [15]:
def generate_random_data(num_samples, num_features):
    x = np.random.uniform(-1, 1, size=(num_samples, num_features))
    return x

Now we have to calculate the output given theta values.


In [16]:
def generate_linear_output(x, thetas):
    tmp = np.insert(x, 0, 1, axis=1)
    t = np.array(thetas)
    y = [np.dot(val, t) for val in tmp]
    return np.array(y).reshape(len(y), 1)

Do the same for logistic output, where the output is either 0 or 1.


In [17]:
def generate_logistic_output(x, thetas):
    tmp = np.insert(x, 0, 1, axis=1)
    t = np.array(thetas)
    ytmp = [np.dot(val, t) for val in tmp]
    y = [0 if val < 0.5 else 1 for val in ytmp]
    return np.array(y).reshape(len(y), 1)

Wrap it up nice and neat as if we just pulled it out of a file.


In [18]:
def create_data(x, y):
    return np.concatenate((x, y), axis=1)

Test linear regression again.


In [19]:
x = generate_random_data(100, 3)
y = generate_linear_output(x, [2, 3, 4, 5])
d = create_data(x, y)
thetas = linear_regression(d, alpha=0.05)
print "Gradient descent", thetas
xd, yd = get_data(d)
solve_linear_regression(xd, yd)


Gradient descent [ 2.00595486  2.96786018  3.99606762  4.96413892]
Out[19]:
matrix([[ 2.],
        [ 3.],
        [ 4.],
        [ 5.]])

Test logistic regression.


In [20]:
x = generate_random_data(1000, 3)
y = generate_logistic_output(x, [2, 3, 4, 5])
d = create_data(x, y)
thetas = logistic_regression(d)
print thetas


[ 1.80411588  3.39723195  4.44518601  5.83452669]

my testing


In [20]:

final thoughts

I consider Gradient Descent to be an interesting application of hill-climbing. The algorithm starts off with random coefficients then incrementally increases or decreases the coefficients to see if they better fit the output data. The test involves calculating the mean-squared sum of the data set and comparing it to the previous calculation. The learning rate parameter scales the calcaultion and determines how fast the coefficients are incremented or decremented. Epsilon is the determining criteria for when to stop the algorithm.

The linear regression algorithm can be solved analytically as shown above, therefore it is easy to compare the output of the algorithm to the truth and solver output. The logistic regression algorithm, however, does not have a solver therefore it is much harder to tell if the algorithm is working properly. Generally speaking, the more data points available to the algorithm, the better it does; but this isn't always the case. It turns out that trying to compute coefficients for classification (output rounded to 0 or 1) still has a lot of error.